

### Computer Architecture

Faculty of Computer Science & Engineering - HCMUT



thanhbinh@hcmut.edu.vn

### Arithmetic for Computers

- Operations on integers
  - Addition and subtraction
  - Multiplication and division
  - Dealing with overflow TAI LIEU SU'U TÂP
- Floating-point real numbers
  - Representation and operations



# Integer Addition

Example: 7 + 6



- Overflow if result out of range suu TÂP
  - Adding +ve and –ve operands, no overflow
  - Adding two +ve operands
    - Overflow if result sign is 1
  - Adding two –ve operands
    - Overflow if result sign is 0



# Integer Subtraction

- Add negation of second operand
- Example: 7 6 = 7 + (-6)+7:0000 0000 ... 0000 01110 -6:1111 1111 ... 1111 1010 +1:0000 0000 ... 0000 0001
- Overflow if result out of range suru Tâp
  - Subtracting two +ve or two -ve operands, no overflow
  - Subtracting +ve from –ve operand
    - Overflow if result sign is 0
  - Subtracting –ve from +ve operand
    - Overflow if result sign is 1



# Dealing with Overflow

- Some languages (e.g., C) ignore overflow
  - Use MIPS addu, addui, subu instructions
- Other languages (e.g., Ada, Fortran) require raising an exception
  - Use MIPS add, addi, sub-instructions
  - On overflow, invoke exception handler
    - Save PC in exception program counter (EPC) register
    - Jump to predefined handler address
    - mfc0 (move from coprocessor reg) instruction can retrieve EPC value, to return after corrective action



# Multiplication







### Multiplication Hardware





# Optimized Multiplier

- Perform steps in parallel: add/shift
- One cycle per partial-product addition.

That's ok, if frequency of multiplications is low





8/19/2021

# Faster Multiplier

Uses multiple adders

Cost/performance tradeoff

Can be pipelined

Several multiplication performed in parallel





### MIPS Multiplication

- Two 32-bit registers for product
  - HI: most-significant 32-bits/co
  - LO: least-significant 32-bits
- Instructions
  - mult rs, rt / multu rs, \$rt
    - 64-bit product in HI/LQU SUU TẬP
  - mfhi rd / mflo rod + cmut-cncp
    - Move from HI/LO to rd
    - Can test HI value to see if product overflows 32 bits
  - mul rd, rs, rt
    - Least-significant 32 bits of product -> rd



### Division



*n*-bit operands yield *n*-bit quotient and remainder

- Check for 0 divisor
- Long division approach
  - Alf divisor ≤ dividend bits
    - J bit in quotient, subtract
  - Otherwise
    - 0 bit in quotient, bring down next dividend bit
- Restoring division
  - Do the subtract, and if remainder goes < 0, add divisor back
- Signed division
  - Divide using absolute values
  - Adjust sign of quotient and remainder as required



### Division Hardware

Done





8/19/2021

### Optimized Divider

- One cycle per partial-remainder subtraction
- Looks a lot like a multiplier!
  - Same hardware can be used for both





8/19/2021

### Faster Division

- Can't use parallel hardware as in multiplier
  - Subtraction is conditional on sign of remainder
- Faster dividers (e.g. SRT dévision) generate multiple quotient bits per step
  - Still require multiple steps



#### MIPS Division

- Use HI/LO registers for result
   HI: 32-bit remainder

  - LO: 32-bit quotient
- Instructions
  - div rs, rt Aldivirs, rt
  - No overflow or divide-by-0 checking
    - Software must perform checks if required
  - Use mfhi, mflo to access result



# Floating Point

- Representation for non-integral numbers
  - Including very small and very large numbers
- Like scientific notation
  - $-2.34 \times 10^{56}$
  - $-0.002 \times 10^{-4}$
  - +987.02 × 10<sup>9</sup>

BỞI HCMUT-CNCP

normalized

not normalized

- In binary
  - $\pm 1.xxxxxxxx_2 \times 2^{yyyy}$
- Types float and double in C



# Floating Point Standard

- Defined by IEEE Std 754-1985
- Developed in response to divergence of representations
  - Portability issues for scientific code
- Now almost universally adopted
- Two representations
  - Single precision (32-bit)
  - Double precision (64-bit)



### IEEE Floating-Point Format

single: 8 bits single: 23 bits double: 11 bits double: 52 bits

S Exponent Fraction CV

 $x = (-1)^{S} \times (1 + Fraction) \times 2^{(Exponent - Bias)}$ 

- S: sign bit  $(0 \Rightarrow non-negative, 1 \Leftrightarrow negative)$
- Normalize significand: 1.0 ≤ | significand | ≤ 2.0
  - Always has a leading pre-binary-point 1 bit, so no need to represent it explicitly (hidden bit)
  - Significand is Fraction with the "1." restored
- Exponent: excess representation: actual exponent + Bias
  - Ensures exponent is unsigned
  - Single: Bias = 127; Double: Bias = 1203



# Single-Precision Range

- Exponents 00000000 and 11111111 reserved
- Smallest value
- Smallest value

  Exponent: 00000001

  ⇒ actual exponent = 1 126
  - Fraction:  $000...00 \Rightarrow \text{significand} = 1.0$
  - $\pm 1.0 \times 2^{-126} \approx \pm 1.2 \times 10^{-38}$ TÀI LIỆU SƯU TẬP
- Largest value
  - Exponent: 11111110 ⇒ actual exponent = 254 127 = +127
  - Fraction:  $111...11 \Rightarrow significand \approx 2.0$
  - $\pm 2.0 \times 2^{+127} \approx \pm 3.4 \times 10^{+38}$



### Double-Precision Range

- Exponents 0000...00 and 1111...11 reserved
- Smallest value
  - Exponent: 00000000001 $\Rightarrow$  actual exponent = 1 - 4023 = -4022
  - Fraction:  $000...00 \Rightarrow \text{significand} = 1.0$
  - $\pm 1.0 \times 2^{-1022} \approx \pm 2.2 \times 10^{-308}$
- Largest value

- BỞI HCMUT-CNCP
- Exponent: 11111111110 ⇒ actual exponent = 2046 – 1023 = +1023
- Fraction:  $111...11 \Rightarrow significand \approx 2.0$
- $\pm 2.0 \times 2^{+1023} \approx \pm 1.8 \times 10^{+308}$



# Floating-Point Precision

- Relative precision
  - all fraction bits are significant
  - Single: approximately 2<sup>-23 \*</sup>
    - Equivalent to  $23 \times log_{10} 2 \approx 23 \times 0.3 \approx 6$  decimal digits of precision LIEU SI
  - Double: approximately 2<sup>-52</sup>
    - Equivalent to  $52 \times \log_{10} 2 \approx 52 \times 0.3 \approx 16$  decimal digits of precision



# Floating-Point Example

Represent –0.75

```
-0.75 = (-1)^{1} \times 1.1_{24} \times 2^{4d} \times C_{8}
S = 1
```

- Fraction = 1000...00
- Exponent = -1 + Bias
  - Single: -1 + 127T ★ 126 € 1011 ↑ 1102
  - Double:  $-1 + 1023 = 1022 = 011111111110_2$
- Single: 1011111101000...00
- Double: 101111111101000...00



# Floating-Point Example

What number is represented by the singleprecision float

```
- 1 10000001 0 1000 ... 00 S = 1
```

- Fraction =  $01000...00_2$
- Exponent =  $10000001_2 = 129$

$$= -5.0$$



#### Your turn

Represent –23.0625 in the IEEE 754 floating point format

Convert value of 0x42450000 (IEEE 754) to floating point value, LIÊU SUU TÂP

**B**ổI HCMUT-CNCP



#### Denormal Numbers

■ Exponent =  $000...0 \Rightarrow$  hidden bit is 0

$$x = (-1)^{S} \times (0 + Fraction) \times 2^{-Bias}$$

- Smaller than normal numbers
  - allow for gradual underflow, with diminishing precision
- Denormal with fraction = 000...0

$$x = (-1)^{S} \times (0+0) \times 2^{-Bias} = \pm 0.0$$

Two representations of 0.0!



#### Infinities and NaNs

- Exponent = 111...1, Fraction = 000...0
  - ±Infinity
  - Can be used in subsequent calculations, avoiding need for overflow checken
- Exponent = 111...1, Fraction ≠ 000...0
  - Not-a-Number (NaN) HEMUT-CNCP
  - Indicates illegal or undefined result
    - e.g., 0.0 / 0.0
  - Can be used in subsequent calculations



# Floating-Point Addition

- Consider a 4-digit decimal example
- 9.999 × 10<sup>1</sup> + 1.610 × 10<sup>-1</sup> HOACNC
   1. Align decimal points
  - Shift number with smaller exponent
  - $9.999 \times 10^{1} + 0.016 \times 10^{1}$
- 2. Add significands
   9.999 × 10¹ + 0.016 × 10¹ = 10.015 × 10¹
- 3. Normalize result & check for over/underflow
  - $1.0015 \times 10^2$
- 4. Round and renormalize if necessary
  - $1.002 \times 10^{2}$



# Floating-Point Addition

- Now consider a 4-digit binary example
- $1.000_2 \times 2^{-1} + -1.110_2 \times 2^{-2}$  1. Align binary points
- - Shift number with smaller exponent
  - $1.000_2 \times 2^{-1} + -0.111_2 \times 2^{-1}$
- 2. Add significands
  - $1.000_{2} \times 2^{-1} + -0.111_{2} \times 2^{+1} = 0.001_{2} \times 2^{-1}$
- 3. Normalize result & check for over/underflow
  - $1.000_2 \times 2^{-4}$ , with no over/underflow
- 4. Round and renormalize if necessary
  - $1.000_2 \times 2^{-4}$  (no change) = **0.0625**



#### FP Adder Hardware

- Much more complex than integer adder
- Doing it in one clock cycle would take too long
  - Much longer than integer operations
  - Slower clock would penalize all instructions
- FP adder usually takes several cycles
  - Can be pipelined



### FP Adder Hardware





# Floating-Point Multiplication

- Consider a 4-digit decimal example
  - $\bullet$  1.110 × 10<sup>10</sup> × 9.200 × 10<sup>-5</sup>
- 1. Add exponents
  - For biased exponents, subtract bias from sum
  - New exponent = 10 + -5 = 500
- 2. Multiply significands
  - $1.110 \times 9.200 = 10.212 \Rightarrow 10.212 \times 10^{5}$
- 3. Normalize result & check for over/underflow?
  - 1.0212 × 10<sup>6</sup>

- **BỞI HCMUT-CNCP**
- 4. Round and renormalize if necessary
  - 1.021 × 10<sup>6</sup>
- 5. Determine sign of result from signs of operands
  - +1.021 × 10<sup>6</sup>



# Floating-Point Multiplication

- Now consider a 4-digit binary example
  - $1.000_2 \times 2^{-1} \times -1.110_2 \times 2^{-2}$  $(0.5 \times -0.4375)$
- 1. Add exponents
   Unbiased: -1 + -2 = -3
   Biased: (-1 + 127) + (-2 + 127) = -3 + 254 127 = -3 + 127
- 2. Multiply significands
  - $1.000_2 \times 1.110_2 = 1.1102 \implies 1.1102 \times 2^{-3}$
- 3. Normalize result & check for over underflow?
  - $1.110_2 \times 2^{-3}$  (no change) with no over/underflow
- 4. Round and renormalize if necessary
  - $1.110_2 \times 2^{-3}$  (no change)
- 5. Determine sign: +ve  $\times$  -ve  $\Rightarrow$  -ve
  - $-1.110_2 \times 2^{-3} = -0.21875$



#### FP Arithmetic Hardware

- FP multiplier is of similar complexity to FP adder
  - But uses a multiplier for significands instead of an adder
- FP arithmetic hardware usually does
  - Addition, subtraction, multiplication, division, reciprocal, square-rootmur-ence
  - FP ↔ integer conversion
- Operations usually takes several cycles
  - Can be pipelined



#### FP Instructions in MIPS

- FP hardware is coprocessor 1
  - Adjunct processor that extends the ISA
- Separate FP registers
  - 32 single-precision: \$f0,\$f1, c. \$f31
  - Paired for double-precision: \$f0/\$f1, \$f2/\$f3, ...
    - Release 2 of MIPs ISA supports 32 × 64-bit FP reg's
- FP instructions operate only for FP registers
  - Programs generally don't do integer ops on FP data, or vice versa
  - More registers with minimal code-size impact
- FP load and store instructions
  - Lwc1, ldc1, swc1, sdc1
    - e.g., ldc1 \$f8, 32(\$sp)



#### FP Instructions in MIPS

Single-precision arithmetic

```
    add.s, sub.s, mul.s, div.s
    e.g., add.s $f0, $f1, $f6

Double-precision arithmetic
```

- Double-precision arithmetic
  - add.d, sub.d, mull.d
    - e.g., mul.d \$f4, \$f4, \$
- Single- and double-precision comparison
  - C.xx.s, c.xx.d (xx is eq, It, Ie, ... Lieu Suu
  - Sets or clears FP condition-code bit
    - e.g. c.lt.s \$f3, \$f4
- Branch on FP condition code true or false
  - bc1t , bc1f
    - e.g.,bclt TargetLabel



### FP Example: °F to °C

C code: float f2c (float fahr) { return ((5.0/9.0)\*(fahr fahr in \$f12, result in \$f0, literals in global memory space Compiled MIPS code: f2c: **lwc1** \$f16, const5(\$gp) 1wc2 \$f18, const9(\$gp)TÀI LIỆU SƯU TÂP BỞI HCMUT-CNCP div.s \$f16, \$f16, \$f18 lwc1 \$f18, const32(\$gp) sub.s \$f18, \$f12, \$f18 mul.s \$f0, \$f16, \$f18 jr \$ra



# FP Example: Array Multiplication

```
X = X + Y \times Z

    All 32 × 32 matrices, 64-bit double-precision elements

   C code:
void mm ( double x[][], double y[][]
             double z[][]) {
    int i, j, k;
    for (i = 0; i! = 32; i = i + 1)
       for (j = 0; j! = 32; jTAI+LIÊUSUUTÂP)
           for (k = 0; k! = 32; k^{B \stackrel{?}{=} 1} k^{CM} + 1)^{NCP}
               x[i][j] = x[i][j]
                          + y[i][k] * z[k][j];
        Addresses of x, y, z in $a0, $a1, $a2, and i, j, k in $s0, $s1, $s2
```



# FP Example: Array Multiplication

#### MIPS code:

```
li $t1, 32 # $t1 = 32 (row size/loop end)
    li $$s0, 0 # i = 0; initialize 1st for loop
L1: li $s1, 0 # j = 0; restart 2nd for loop
L2: li $s2, 0  # k = 0; restart 3rd for loop
    sll $t2, $s0, 5 # $t2 = i * 32 (size of row of x)
    addu $t2, $t2, $s1 # $t2 = i * size(row) + j
    sll $t2, $t2, 3 # $t2 = byte offset of [i] [i] [i]
    addu $t2, $a0, $t2 # $t2 = byte addressorf x[i][j]
    1.d $f4, 0($t2) # $f4 = 8 bytes of x[i][j]
L3: sll $t0, $s2, 5 # $t0 = k * 32 (size of row of z)
    addu $t0, $t0, $s1 \# $t0 = k * size(row) + j
    sll $t0, $t0, 3 # $t0 = byte offset of [k][j]
    addu $t0, $a2, $t0 \# $t0 = byte address of z[k][j]
    1.d $f16, 0($t0) # $f16 = 8 bytes of z[k][i] ...
```

## FP Example: Array Multiplication

```
$t0, $s0, 5
             # $t0 = i*32 (size of row of y)
sll
    $t0, $t0, $s2  # $t0 = i*size(row) + k
addu
    $t0, $t0, 3  # $t0 = byte offset of [i][k]
sll
addu $t0, $a1, $t0  # $t0 = byte address of y[i][k
   1.d
mul.d $f16, $f18, $f16 # $f16 = y[i][k] * z[k][j]
add.d \$f4, \$f4, \$f16 # f4=x[i][j] + y[i][k]*z[k]
                                TÀI LIÊU SƯU TẬP
addiu $s2, $s2, 1 # $k k + 1
   $s2, $t1, L3  # if (k != 32) go to L3
bne
s.d
   addiu $s1, $s1, 1 # $j = j + 1
   $s1, $t1, L2 # if (j != 32) go to L2
bne
addiu $s0, $s0, 1 \# $i = i + 1
   $s0, $t1, L1 # if (i != 32) go to L1
bne
```



#### Accurate Arithmetic

- IEEE Std 754 specifies additional rounding control
  - Extra bits of precision (guard, round, sticky)
  - Choice of rounding modes
  - Allows programmer to fine-tune numerical behavior of a computation
- Not all FP units implement all options
  - Most programming languages and FP libraries just use defaults
- Trade-off between hardware complexity, performance, and market requirements



#### Subword Parallellism

- Graphics and audio applications can take advantage of performing simultaneous operations on short vectors
  - Example: 128-bit adder:
    - Sixteen 8-bit adds
    - Eight 16-bit adds LIỆU SƯU TẬP
    - Four 32-bit adds Both CMUT-CNCP
- Also called data-level parallelism, vector parallelism, or Single Instruction, Multiple Data (SIMD)



#### x86 FP Architecture

- Originally based on 8087 FP coprocessor
  - 8 × 80-bit extended-precision registers
  - Used as a push-down stack
  - Registers indexed from TOS: ST(0), ST(1), ...
- FP values are 32-bit or 64 in memory
  - Converted on load/store of memory operand
  - Integer operands can also be converted on load/store
- Very difficult to generate and optimize code
  - Result: poor FP performance



#### x86 FP Instructions

Optional variations

I: integer operand

P: pop operand from stack

R: reverse operand order

But not all combinations allowed

| Data transfer   | Arithmetic                         | Compare    | Transcendental |
|-----------------|------------------------------------|------------|----------------|
| FILD mem/ST(i)  | FIADDP mem7\$T(i)LI                | PICOMPUTÂP | FPATAN         |
| FISTP mem/ST(i) | FISUBRP mem/ST(i) <sup>B ở i</sup> | FUCOMP     | F2XMI          |
| FLDPI           | FIMULP mem/ST(i)                   | FSTSW      | FCOS           |
| FLD1            | FIDIVRP mem/ST(i)                  | AX/mem     | FPTAN          |
| FLDZ            | FSORT                              |            | FPREM          |
|                 | FABS<br>FRNDINT                    |            | FPSIN          |
|                 |                                    |            | FYL2X          |



#### Streaming SIMD Extension 2 (SSE2)

- Adds 4 × 128-bit registers
  - Extended to 8 registers in AMD64/EM64T
- Can be used for multiple FP operands
  - 2 × 64-bit double precision
  - 4 × 32-bit double precision
  - Instructions operate on them simultaneously
    - Single-Instruction Multiple-Data



Unoptimized code:

```
void dgemm (int n, double* A, double* B, double* C) {
       for (int i = 0; i < n; ++i) {
3.
          for (int j = 0; j < n;
             /* cij = C[i][j]^{4}*/
             double cij = C[i+j*n]
5.
6.
             for (int k = 0; k < n; k++
                /* cij += A[i][k]*B[k]
8.
             C[i+j*n] = cij; /* C[i][j]
9.
10.
11.
12. }
```



#### x86 assembly code:

```
1. vmovsd (%r10), %xmm0 # Load 1 element of C into %xmm0
2. mov %rsi,%rcx # register %rcx = %rsi
3. xor %eax, %eax # register %eax = 0
4. vmovsd (%rcx), %xmm1 # Load 1 element of B into %xmm1
5. add %r9,%rcx # register %rcx = %rcx + %r9
6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1, element of A
                             register %rax = %rax + 1
7. add $0x1,%rax
                           # compare %eax to %edi
8. cmp %eax, %edi
9. vaddsd %xmm1, %xmm0, %xmm0 # add %xmm1, %xmm0
10. jq 30 \langle dqemm+0x30\rangle # jump if %eax > %edi
11. add \$0x1,\$r11d # register \$r11 = \$r11 + 1
12. vmovsd %xmm0, (%r10) # Store %xmm0 into C element
```



#### Optimized C code:

```
1. #include <x86intrin.h>
2. void dgemm (int n, double* A, double* B, double* C)
3. {
      for ( int i = 0; i < n; i+=4 )
         for ( int j = 0; j < n; j++ )</pre>
5.
            _{m256d} c0 = _{mm256} load pd(C+i+j*n)
6.
            /* c0 = C[i][i] */
            for( int k = 0; k < n; k++ ) TAI LIÊU SƯU TÂP</pre>
7.
8.
                c0 = mm256 \text{ add pd(}c0,
                                                 BổI HCMUT-CNCP
                /* c0 += A[i][k]*B[k][j] */
9.
            mm256 \text{ mul pd} (mm256 \text{ load pd} (A+i+k*n),
            mm256 broadcast sd(B+k+j*n)));
10.
            mm256 store pd(C+i+j*n, c0); /* C[i][j] = c0 */
11.
12.
13. }
```



#### Optimized x86 assembly code:

```
1. vmovapd (%r11), %ymm0 # Load 4 elements of C into %ymm0
2. mov %rbx, %rcx # register %rcx = %rbx
3. xor %eax, %eax # register %eax =
4. vbroadcastsd (%rax, %r8,1), %ymm1 # Make 4 copies of B element
5. add \$0x8,\$rax \# register \$rax = \$rax + 8
6. vmulpd (%rcx), %ymm1, %ymm1 # Parallel mul %ymm1, 4 A elements
7. add %r9,%rcx # register %rcx = %rcx + %r9
8. cmp %r10, %rax # compare %r10 to %rax
9. vaddpd %ymm1, %ymm0, %ymm0 # Parallel add %ymm1, %ymm0
10. jne 50 <dgemm+0x50> # jump if not %r10 != %rax
11. add \$0x1,\$esi # register \$ esi = \$ esi + 1
12. vmovapd %ymm0, (%r11) # Store %ymm0 into 4 C elements
```



## Right Shift and Division

- Left shift by i places multiplies an integer by 2<sup>i</sup>
- Right shift divides by 21? \*\*
  - Only for unsigned integers
- For signed integers
  - Arithmetic right shift: replicate the sign bit
  - e.g., -5 / 4
    - $\blacksquare$  11111011<sub>2</sub> >> 2 = 11111110<sub>2</sub> = -2
    - Rounds toward -∞
  - c.f.  $111111011_2 >>> 2 = 001111110_2 = +62$



## Associativity

Parallel programs may interleave operations in unexpected orders

Assumptions of associativity may fail

|   |           | (x+y)+z          | x+(y+z)        | 3    |
|---|-----------|------------------|----------------|------|
| X | -1.50E+38 |                  | -1.50E+38      |      |
| у | 1.50E+38  | 0.00E+0 <u>0</u> | À <del>-</del> |      |
| Z | 1.0       | 1.0              | 1.50E+38       | JIĄP |
|   |           | 1.00E+00         | 0.00E+00       | 01   |

 Need to validate parallel programs under varying degrees of parallelism



## Who Cares About FP Accuracy?

- Important for scientific code
  - But for everyday consumer use?
    - "My bank balance is out by 0.0002¢!" ⊗
- The Intel Pentium FDIV bug
  - The market expects accuracy
  - See Colwell, The Pentium Chronicles



# Concluding Remarks

- Bits have no inherent meaning
  - Interpretation depends on the instructions applied
- Computer representations of numbers
  - Finite range and precision
  - Need to account for this in programs
- ISAs support arithmetic
  - Signed and unsigned integers
  - Floating-point approximation to reals SUU TÂP
- Bounded range and precision BÖI HCMUT-CNCP
  - Operations can overflow and underflow
- MIPS ISA
  - Core instructions: 54 most frequently used
    - 100% of SPECINT, 97% of SPECFP
  - Other instructions: less frequent

